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ABSTRACT 

Publishing person-specific transactions in an anonymous 
form is increasingly required by organizations. Recent 
approaches ensure that potentially identifying information 
(e.g., a set of diagnosis codes) cannot be used to link pub- 
lished transactions to persons' identities, but all are lim- 
ited in application because they incorporate coarse privacy 
requirements (e.g., protecting a certain set of m diagnosis 
codes requires protecting all m-sized sets), do not integrate 
utility requirements, and tend to explore a small portion 
of the solution space. In this paper, we propose a more 
general framework for anonymizing transactional data un- 
der specific privacy and utility requirements. We model 
such requirements as constraints, investigate how these con- 
straints can be specified, and propose COAT (COnstraint- 
based Anonymization of Transactions), an algorithm that 
anonymizes transactions using a flexible hierarchy-free gen- 
eralization scheme to meet the specified constraints. Exper- 
iments with benchmark datasets verify that COAT signifi- 
cantly outperforms the current state-of-the-art algorithm in 
terms of data utility, while being comparable in terms of 
efficiency. The effectiveness of our approach is also demon- 
strated in a real-world scenario, which requires disseminat- 
ing a private, patient-specific transactional dataset in a way 
that preserves both privacy and utility in intended studies. 

1. INTRODUCTION 

Organizations in various domains, ranging from health- 
care to government, increasingly share person-specific data, 
devoid of explicit identifiers (e.g., names), to enable research 
and comply with regulations. For instance, the U.S. Na- 
tional Institutes of Health (NIH) recently mandated that 
NIH-sponsored investigators disclose data collected or stud- 
ied in a manner that is "free of identifiers that could lead to 
deductive disclosure of the identity of individual subjects" 
[?]. Numerous studies demonstrate that de-identification 
(i.e., the removal of explicit identifiers) is insufficient for 
privacy protection of transactional data (i.e., data in which 



a set of items correspond to an individual) [13] [25| [10] . This 
is because published transactions can disclose the identity 
of an individual associated with a transaction, if an attacker 
knows some of the items this individual is associated with. 
Imagine for example that Alice was diagnosed with the dis- 
eases contained in the first transaction of Fig. |l(a)| and told 
her neighbor Bob that she suffers from a, b and c, which 
are relatively common. Publishing a de-identified version 
of the data of Fig. |l(a)| allows Bob to find out that the 
first transaction corresponds to Alice, since there is only one 
transaction containing a, b and c in this dataset. This prob- 
lem, referred to as identity disclosure, must be addressed to 
comply with regulations and to protect individuals' privacy. 
Having identified Alice's transaction, for example, Bob can 
infer that Alice also suffers from the diseases d, e, f,g and h. 

1.1 Motivation 

To prevent identity disclosure, portions of transactions 
that are potentially linkable to identifying information need 
to be explicitly specified and protected prior to data release. 
This involves formulating a set of privacy constraints, which 
are satisfied by transforming data so that each individual 
is linked to a sufficiently large number of transactions with 
respect to these constraints. This process achieves privacy 
because an attacker must distinguish an individual's real 
transaction among the transformed ones to identify him/her. 
However, data transformation may harm data utility when 
usability requirements are unaccounted for, resulting in re- 
leased data that is subpar for intended applications. Thus, 
it is essential to balance privacy constraints with utility con- 
straints to ensure meaningful analysis. To show the impor- 
tance of both types of constraints consider Example [1] 
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Figure 1: Example dataset and constraints. 

Example 1. Imagine that a hospital needs to publish the 
dataset of Fig. \l(a)\ where each transaction corresponds to 
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Figure 2: Three possible anonymizations of the dataset of Fig. 1(a) 



a patient (patients' names are not released) and consists of 
a set of diagnosis codes (items). Certain item combinations 
(itemsets), such as the combinations of diagnosis codes abc 
and defgh of Fig. \l(b)\ are regarded as potentially link- 
able and must be associated with at least 5 transactions to 
prevent identity disclosure. At the same time, the published 
data must be able to support a study in which the number 
of patients diagnosed with cold, denoted with c, needs to be 
accurately determined. These requirements can be modeled 
via the privacy constraints in Fig. 1(b) and a utility con- 
straint {c}. An anonymization that satisfies them is given 
in Fig. 



2(c) 



Observe that each patient can be linked to 
no less than 5 transactions using the combinations abc or 
defgh, while the number of patients suffering from cold can 
still be accurately computed after anonymization. 

1.2 Limitations of Existing Methodologies 

Preventing identity disclosure in transactional data was 
investigated recently [201125] : however, existing approaches 
inadequately deal with the scenario of Example [T] for sev- 
eral reasons. First, they support a limited class of privacy 
requirements. While they are effective at protecting all item- 
sets comprised of a certain number of items (i.e., itemsets 
of certain size), potentially linkable itemsets may involve 
certain items only and vary in size. For instance, in the 
presence of the privacy constraints in Example [1] the ap- 
proaches of [20] and [25] would protect all 56 combinations 
of 5 diagnosis codes (e.g., abcde,bcdef , etc.). Unnecessarily 
protecting itemsets significantly distorts data because the 
number of itemsets that require protection rapidly increases 
with their size. 

Second, prior approaches neglect specific data utility re- 
quirements. Thus, they do not guarantee generating prac- 
tically useful solutions for environments where usability 
is based on well-defined policies, such as in epidemiology 
(where combinations of diagnosis codes form syndromes) 
|12|. For instance, when applied to the dataset of Fig 



1(a) 



|20| and [25] produce the anonymizations of Figs. 2(a) 
and 2(b)| (respectively). These anonymizations do not sup- 
port the study of Example 1 because they do not allow the 
number of patients suffering from cold to be accurately com- 
puted, as a result of violating the utility constraint {c}. 

Third, the existing literature considers only a small num- 
ber of possible transformations to meet privacy constraints. 
For example, [20] protects a, b and c by generalizing them 
to their closest common ascendant (a, b, c) according to the 
hierarchy of Fig. [3] while [25] does so by eliminating them 
from the released dataset. This incurs excessive information 



loss, but, as we show in this paper, is unnecessary because 
items can be generalized together in a hierarchy-free manner 
to reduce the amount of information loss incurred. 




Figure 3: Hierarchy for the dataset of Fig. 1(a) 



1.3 Contributions 

This paper proposes an innovative approach to anonymize 
transactional data under privacy and utility constraints. 
Given a dataset and a set of constraints, our approach pre- 
vents identity disclosure by ensuring that each transaction is 
indistinguishable from at least k — 1 other transactions with 
respect to privacy constraints, while satisfying utility con- 
straints. For instance, when applied to anonymize the data 
in Fig. 1(a) using the constraints in Figs. 1(b) and 1(c) our 
approach generates the anonymization of Fig. |2(c)| which 
satisfies the imposed constraints. 

Our work makes the following specific contributions: 

First, we propose a novel constraint specification model that 
enables data owners to express detailed privacy and utility 
requirements. We introduce privacy constraints that allow 
protecting the required itemsets only, thereby reducing in- 
formation loss, and utility constraints that ensure the utility 
of anonymized data in practice. Acknowledging that con- 
straint specification may be difficult in lack of specific do- 
main knowledge, we also propose an algorithm to extract 
privacy constraints from the data and a recipe for specify- 
ing utility constraints. 

Second, we propose an item generalization model that 
eliminates the requirement for hierarchies. This is impor- 
tant because many domains do not fit into rigid hierarchies 
[6lll5|. In doing so, our framework can produce fine-grained 
anonymizations with substantially better utility than that 
of [20] and [25] . For example, to meet the privacy constraint 



pi (Fig. 1(b) I, our solution generalizes a and b to (o, b) and 
leaves c intact, retaining more information than [20], which 
releases (a, 6, c), and [25], which eliminates a,b and c. 

Third, we develop COnstraint-based Anonymization of 



Transactions (COAT), an algorithm that iteratively selects 
privacy constraints and transforms data to satisfy them. For 
each privacy constraint, COAT generalizes items in accor- 
dance with the specified utility constraints and attempts to 
minimize information loss. When a privacy constraint can- 
not be satisfied through generalization, COAT suppresses 
the least number of items required to meet this constraint. 

Fourth, we investigate the effectiveness of our approach 
through experiments on widely-used benchmark datasets 
and a case study on patient-specific data extracted from the 
Electronic Medical Record system [18] of Vanderbilt Univer- 
sity Medical Center, a large healthcare provider in the U.S. 
Our results verify that the proposed methodology is able 
to anonymize transactions under various privacy and utility 
constraints with less information loss than the state-of-the- 
art method [20] and to generate high-quality anonymizations 
in a real-world data dissemination scenario. 

1.4 Paper Organization 

The rest of the paper is organized as follows. Related work 
is reviewed in Section [2] We formally define our constraint 
specification model and the problem of anonymizing trans- 
actions under constraints in Section [3] COAT is presented 
in Section[4] Section [5] presents methods for specifiying con- 
straints. Sections [6] and [7] report experimental results and 
a case study of applying COAT respectively. Finally, we 
conclude the paper in Section [8] 

2. RELATED WORK 

The problem of identity disclosure in relational data pub- 
lishing has been studied extensively [4, 8, 9, 16 , 19]. A well- 
established principle that can prevent this type of threat 
is fc-anonymity [1611 19] . A relational table is fc-anonymous 
when each record is indistinguishable from at least k — 1 oth- 
ers with respect to potentially identifying attributes (termed 
quasi- identifiers or QIDs). A'-anonymity can be achieved by 
generalization, a process in which QID values are replaced 
by more general ones specified by a generalization (recod- 
ing) model, or via suppression, a technique that removes 
values or records from anonymized data [S]. In this work, 
we anonymize transactions to thwart identity disclosure by 
applying item generalization and suppression. Beyond iden- 
tity disclosure is another threat in relational data publishing 
known as attribute disclosure (i.e., the inference of an indi- 
vidual's sensitive values), which can be guarded against by 
several principles, such as /-diversity [11]. We note that our 
approach can be extended to prevent attribute disclosure as 
well, but this is beyond the scope of this paper. 

Privacy-preserving publication of transactions was re- 
cently investigated [3[20][25]. First, in [20], fc m -anonymity 
was proposed to prevent attackers with the knowledge of at 
most m items from linking an identified individual to less 
than k published transactions. The authors of [20] designed 
three algorithms to enforce fc m -anonymity, but the Apri- 
ori algorithm is the only one that is sufficiently scalable for 
use in practice. It operates in a bottom-up fashion, begin- 
ning with itemsets comprised of one item and subsequently 
considers incrementally larger itemsets. In each iteration, 
fc m -anonymity is enforced using a hierarchy-based general- 
ization model. Second, [25] proposed (h, fc,p)-coherence, a 
privacy principle that addresses both identity and attribute 
disclosure. This principle assumes a fixed classification of 
items into potentially linkable and sensitive, treats poten- 



tially linkable items similarly to fc m -anonymity, and addi- 
tionally limits the probability of inferring sensitive items. 
Following [20], we do not adopt such a classification, and al- 
low any item to be treated as potentially linkable for the re- 
maining (sensitive) items. To satisfy (h, fc,p)-coherence, [25] 
proposed an algorithm that discovers all unprotected item- 
sets of minimal size and protects them by iteratively sup- 
pressing the item contained in the greatest number of those 
itemsets. The primary differences between our work and 
the approaches of [20] and [25] were discussed in the Intro- 
duction. Finally, [5] developed a method that eliminates at- 
tribute disclosure based on bucketization [23] and /-diversity. 
Our work is orthogonal to [5]; we do not aim to thwart at- 
tribute disclosure, but rather apply item generalization and 
suppression to guard against identity disclosure. 

Preserving privacy has also been considered in contexts 
related to knowledge sharing, where the goal is to prevent 
the inference of rules or patterns [2] [21]. Our method is 
fundamentally different from this line of research, as we aim 
to publish data that prevents the disclosure of individuals' 
identities instead. 

3. BACKGROUND AND PROBLEM FOR- 
MULATION 

Let X = {ii , jm} be a finite set of literals, called items. 
Any subset I C X is called an itemset over X, and is rep- 
resented as the concatenation of the items it contains. An 
itemset that has m items or equivalently a size of m, is called 
an m-itemset. A dataset T> = {Ti,...,Tn} is a set of N 
transactions. Each transaction T n , n = 1, N , over X cor- 
responds to a unique individual and is a pair T n — (tid,I), 
where / is the itemset and tid is a unique identifier. A trans- 
action T n — (tid, J) supports an itemset I over X, if / C J. 
Given an itemset I over X in T>, we use sup(I, T>) to rep- 
resent the number of transactions T„ 6 P that support I. 
This set of transactions, called supporting transactions of I 
in T>, is denoted as T>i. 

3.1 Set-based Anonymization 

We propose a set-based anonymization model for transac- 
tional data, formally defined as follows: 

Definition 3.1. (Set-Based Anonymization). A set- 
based anonymization of X is a set X — {i±, ...,i A ~ f } with the 
following properties: (1) each item in X is mapped to a 
unique item i m 6 X, m £ [1, M], that is a subset of X, using 

an anonymization function $:I^I, (2) Um=i * m = 
where S is the set of items mapped to the empty subset ofX, 
and (3) i r n i s — 0, for any i r ,i s £ X, r ^ s. 

i is a generalized item when it contains at least one i r 6 X 
that is mapped to a non-empty subset of X. We use the 
notation i = (ii, . . . , i m ) to refer to its elements (items) from 
X. Any item from X that is mapped to the empty subset of 
X, denoted as ( ), is called suppressed, and is contained in the 
set S. An example of a set-based anonymization is shown in 
Fig. [4] Notice that items a and 6 are mapped to the same 
generalized item (a, b), whereas item d is suppressed. 

The set-based anonymization model is very flexible be- 
cause it does not force any items to be generalized together, 
as formally shown in Corollary 13.11 This is different from 
the full-subtree generalization model [7] adopted in [20] . 




Figure 4: An example of set-based anonymization. 

which forces all siblings of an original (leaf-level) item to 
be mapped to an intermediate node in the hierarchy when 
this item is generalized to the intermediate node. 

Corollary 3.1. In the set-based anonymization model, 
mapping an item i r £ X to a generalized item i does not 
force any other item i s £ X to be mapped to i. 



Additionally, as explained in Corollary 13.21 the set-based 
anonymization model contains the generalization model 
used in [20] as a special case. Thus, our model enables ex- 
ploring a much larger set of possible anonymizations, which 
have the potential to incur less information loss. 

Corollary 3.2. The full-subtree recodmg model is a 
special case of the set-based anonymization model, where 
each i m , rn £ [1, M], is mapped to an intermediate node 
of the considered hierarchy. 

Our anonymization model transforms a dataset T> into a 
new dataset T> that helps prevent identity disclosure, since 
the number of transactions of D that can be associated with 
an individual is increased, as proven in Theorem 13. II 

Theorem 3.1. (Generalization Principle). Given 
two items i r ,i s that appear in transactions of T) and are 
mapped to the same generalized item i after anonymizing T> 
to T>, and an itemset i r i s , it holds that 

sup(i,T>) = sup{i r ,T>) + sup(i s ,T)) — sup(i r i 3 ,T>) 

Proof. The proof follows directly from the fact that the 
items i r and i s , and the itemset i r i s are mapped to a com- 
mon literal i £ T> in all transactions of T> that support i r , i s 
or i r i a . D 



We illustrate Theorem l3.1l using Figs. [T(a)1and[2(c)| Con- 
sider items a, b and the itemset ab in Fig. |l(aj[ which have 
support of 6, 3 and 2 respectively, and are mapped to the 



same generalized item (a, 6) in Fig. 2(c) Observe that (a, b) 



has a support of 7 that is equal to the sum of the supports 
of a, b minus that of ab. 

3.2 Privacy Constraints 

The integration of privacy constraints is central to our 
framework because they allow for the explicit definition of 
which itemsets are potentially linkable and require protec- 
tion. In what follows, we formally define the notion of pri- 
vacy constraints and their satisfiability. 

Definition 3.2. (Privacy Constraint Set). A pri- 
vacy constraint p is a non-empty set of items in X that are 
specified as potentially linkable. The union of all privacy 
constraints formulates a privacy constraint set V . 



Definition 3.3. (Privacy Constraint Satisfiabil- 
ity). Given a parameter k, a privacy constraint p = 
{ii,...,i r } £ V is satisfied when the corresponding itemset 
U„ = i^('m) is'. (1) supported by at least k transactions in 
T> , or (2) not supported in T> and each of its proper subsets 
is either supported by at least k transactions in T> or not 
supported inT>. V is satisfied when every p £ P is satisfied. 

To illustrate these definitions, consider the privacy con- 
straint pi — {a, b, c} in Fig. 1 1(b) This privacy constraint 
5 in the dataset of Fig. |2(c) ' ' 



is satisfied for k 

transactions support the itemset $(a)U"I , (&)U<E>(c 



because 5 
(a, b)c. 



Satisfying a privacy constraint p prevents identity disclo- 
sure because the number of transactions that can be linked 
to an individual using any subset of items in p is either at 
least k, or zero, as shown in Theorem 13.21 



Theorem 3.2. (Monotonicity). For a given k, the 
satisfaction of a privacy constraint p in T> implies that each 
privacy constraint pj C p is satisfied in T>. 

Proof. Assume that a privacy constraint pj C p is not 
satisfied in T> for this value of k. Then, according to Defi- 
nition 13.31 the satisfaction of p implies that the itemset I — 
Uvi ep^K*™) * s supported by either at least k or trans- 
actions in T>. Now, consider an itemset J = |Jv; e P $(im), 
which is derived by applying $ on each item in pj. Since 
J C /, we have sup(J, T>) > sup(I, V) when sup(I, V) > k 
due to the monotonicity principle [T]. When sup(I,T>) = 0, 
Pj is satisfied by Definition 13.31 In either case Pj is satisfied 
in T> for the given k, which contradicts the assumption and 
proves that the theorem holds true. □ 

Our privacy constraint specification model offers two ben- 
efits. First, it allows data owners to specify a range of dif- 
ferent privacy requirements. For instance, it can be used to 
protect specific itemsets of various sizes, or to provide the 
same privacy guarantees as fc m -anonymity (by formulating 
a privacy constraint set that consists of all itemsets of size 
m). Second, our model allows protecting any set of item- 
sets without enforcing any additional itemsets to be unnec- 
essarily protected. This is important because unecessarily 
protecting itemsets may significantly increase the amount of 
information loss incurred to anonymize data. 

3.3 Utility Constraints 

Privacy protection is offered at the expense of data util- 
ity [Hlll9j. and so it is important to ensure that anonymized 
data is not overly distorted. Existing approaches attempt 
to do so by minimizing the amount of information loss in- 
curred when anonymizing transactions [5][20], but do not 
guarantee furnishing a useful result for intended applica- 
tions. By contrast, our methodology offers such guaran- 
tees through the introduction of utility constraints. Be- 
fore formally defining such constraints, we make the fol- 
lowing important observations related to data usefulness. 

Observation 1 Mapping a set of items in T> to the 
same generalized item in the anonymized dataset T> in- 
troduces distortion because these items become indistin- 
guishable in T>. When there is no control of how spe- 
cific items are generalized, T> may not be practically useful. 



Observation 2 Suppressing an item in T> introduces distor- 
tion because this item is not contained in T> and the amount 
of distortion increases with the number of suppressions. 

Based on these observations, we introduce a utility con- 
straint set U to limit the amount of generalization items are 
allowed to receive based on application requirements, and 
bound the number of items that can be suppressed using a 
threshold s. Definitions 13.41 and 13.51 illustrate the definition 
of a utility constraint set and its satisfiability respectively. 

Definition 3.4. (Utility Constraint Set). A utility 
constraint set U is a partition of X that declares the set of 
allowable mappings of the items from X to those ofX through 
Each element oflA is called a utility constraint. 

Definition 3.5. (Utility Constraint Set Satisfia- 
bility). Given sets X, X, a utility constraint set U, and a 
parameter s, U is satisfied if and only if (1) for each non- 
empty im £ X, 3uj £ U such that i m C Uj, and (2) the frac- 
tion of items in X contained in the set of suppressed items 
S is at most s%. 

The first condition limits the maximum amount of gen- 
eralization each item is allowed to receive in a set-based 
anonymization X, while the second condition ensures that 
the number of suppressed items is controlled by a threshold 
specified by data owners. When both of these conditions 
hold, U is satisfied, and X corresponds to a dataset that can 
be meaningfully analyzed. Example [2] illustrates the above 
definitions. 

Example 2. Consider Fig. [5[ in which ii C ui, 82 C U2, 
and 14,15,16 are subsets of U4. The percent of suppressed 
items is 12.5% because X consists of 8 items (see Fig. [Jp, 
and only d is suppressed. Thus, U is satisfied. 



Proof. The proof is omitted, because it is similar to that 
of Theorem O □ 
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Figure 5: Utility Constraint Set Satisfiability exam- 
ple for the set-based anonymization of Fig. [4l 

We also observe that the number of the supporting trans- 
actions of a generalized item in the anonymized dataset T> 
is equal to the number of transactions supporting any item 
in the original dataset T> that is mapped to this generalized 
item, as illustrated in Theorem 13.31 



Theorem 3.3. Given a generalized item i m £ X such that 
im = {iii ■••> ir}> it holds that 

Pi-J = Pii U...U2? (r | 

where \Df \ denotes the size of the set of supporting trans- 
actions of im in T>, and \T>i 1 U . . . U T>i r \ the size of the 
set of transactions supporting at least one of the items in 
{it,..., i r } in V. 



Based on Theorem 13.31 we provide the following corollary 
that highlights the importance of anonymizing data while 
satisfying utility constraints. 

Corollary 3.3. Given a utility constraint set U that is 
satisfied, a utility constraint Uj = {ii, ...,i r } £ U, and a set 
of generalized items {ii, ...,i s } constructed by mapping each 
element of Uj to one of these items, it holds that 



Pi, U • • • U Vr 3 I 



\v 4l u . . . u v ir l 



Thus, the number of transactions of T> supporting any 
item contained in a utility constraint Uj £ U can be accu- 
rately computed from the anonymized dataset T>, when U 
is satisfied and all items in Uj have been generalized. This 
is crucial in many data analysis tasks (e.g., in generalized 
association rule mining [17]) where the support of itemsets 
corresponding to aggregate concepts (i.e., itemsets with a 
more general meaning than the items they are comprised 
of) needs to be determined, as illustrated below. 



Example 3. Consider that the dataset of Fig. 1(a) has 
to be anonymized to support a study in which the number of 
patients diagnosed with diabetes needs to be accurately com- 
puted. Assume also that diagnosis codes a and b correspond 
to two different forms of diabetes. Observe that the number 
of patients suffering from diabetes (i.e., transactions having 
a, b or ah) in the dataset of Fig. \l(a)\ is the same as in the 



anonymization of this dataset shown in Fig. 2(c) because 
this anonymization satisfies the utility constraints of Fig. 
1(c) and both a and b in u\ — {a, b} have been generalized. 



3.4 Information Loss 

There may be many anonymizations that satisfy the pri- 
vacy and utility constraint sets, but they may not be equally 
useful. Since discovering the one that least harms data util- 
ity is important, we propose a measure to capture data util- 
ity based on information loss. 

Definition 3.6. (Utility Loss for a Generalized 
Item). The Utility Loss (UL) for a generalized item i m is 
defined as 

jtt / ~ \ 2 |lml -l ~. sup(im,T>) 

UL M = 2M _ 1 X W{l m ) X 



N 



where \i m \ denotes the number of items from X mapped to i m 
using <3>, and w : X — ¥ [0, 1] is a function assigning a weight 
to i m . 

UL measures the amount of information loss caused by 
generalizing a set of items as a product of three terms. The 
first term penalizes a generalized item based on the number 
of items from X mapped to it. This is because a generalized 
item can be interpreted as any of the 2' lm ' — 1 non-empty 
subsets of the set of items mapped to it [5], and there are 
2 — 1 possible non-empty subsets that can be formed us- 
ing items from X. The second term is a weight specified 
by data owners to quantify the harm to data utility caused 
by a generalized item, according to the items mapped to it. 
Weights need to be between and 1 for normalization pur- 
poses, where larger weights are assigned to generalized items 
comprised of items that are more semantically distant, since 



such generalized items harm data utility more [20]. The se- 
mantic distance of items can be computed in many ways 
(e.g., based on the height of a hierarchy [IB], the number of 
leaves of the closest common ascendant of these items in a 
hierarchy [23], with the aid of ontologies [25], or by expert 
knowledge). The third term is the support of a generalized 
item in the anonymized dataset, normalized by the number 
of transactions. Items that appear often in the anonymized 
dataset are penalized more, since they introduce more data 
distortion. Example 0] illustrates how UL can be computed. 



Algorithm 1 COAT(B, V, U, k, s) 



Example 4. Consider Fig. 1(a) and the anonymized 
dataset of Fig. \2(c)\ Items a and b are generalized to (a,b), 
which is assigned a weight of 0.375 specified by the data 
owner, and has a support of 7 in Fig. 
(a, 6) is computed as §s— j x 0.375 x | t 



2(c) The UL for 



0.004. 



Based on Definition 13.61 we quantify the total amount of 
information loss for an anonymized dataset T> as follows. 

Definition 3.7. (Utility Loss for an Anonymized 
Dataset). The Utility Loss for an anonymized dataset T) is 
given by 

UL{V)= J2 UL{im)+ 

where Y : I — > 5ft is a function assigning a penalty to each 
suppressed item i m from T>. 

The above definition captures data utility loss caused by 
both generalization and suppression. Specifically, for sup- 
pression, similar to [5S], we allow data owners to assign a 
penalty to each suppressed item, according to the perceived 
importance of retaining this item in the anonymized result. 
For instance, each suppressed item could receive a penalty 
equal to its support, based on the fact that this defines the 
number of transactions from which it is eliminated. 

3.5 Problem Statement 

Given a transactional dataset T>, a privacy constraint set V , 
a utility constraint set U, and parameters k, s, construct an 
anonymized version T> of T> using the set-based anonymiza- 
tion model such that: (1) V and U are both satisfied, and 
(2) the amount of utility loss UL(T>) is minimal. 

4. ANONYMIZATION ALGORITHM 

We now present COAT (COnstraint-based Anonymiza- 
tion of Transactions), a heuristic algorithm that solves the 
aforementioned problem using item generalization and sup- 
pression. Given T>,V,U,k and s, COAT selects a privacy 
constraint p € V, and applies item generalizations that are 
specified by U and incur the smallest amount of information 
loss to satisfy p. When p cannot be satisfied by generaliza- 
tion, COAT suppresses the minimum number of items in p to 
satisfy it. The process is repeated for all privacy constraints 
until V is satisfied. 

Pseudocode for COAT is provided in Algorithm [T] Since 
the anonymized dataset T> is produced by transforming 
items in transactions of the original dataset T>, in step 1, 
we initialize T> to T>. Steps 2 to 14 present the main itera- 
tion of COAT, which aims to satisfy the privacy constraint 
set V (step 2). In step 3, COAT selects the privacy con- 
straint p that is not satisfied and corresponds to an itemset 



4. 
5. 

6. 

7. 

8. 

9. 

10. 

11. 

12. 

13. 

14. 
15. 



V <- V 

while V is not satisfied do 
find v corresponding to I s.t. 

7 <!— arg max (sup(U(; e „.)ir,P) 

VpjGT-.pjis not satisfied V r 3 / 

while p is not satisfied A |7| > 1 do 

i m <— argmin sup(i r ,T3) 



{uj g U | i m £ Uj} 



if hi > 1 

Generalize(i m , u; , V) 
else if sup(i m ,'D) < k 
Suppress (im, Ui, V, s) 
if p is not satisfied A |7| = 1 
while p is not satisfied do 
argmin sup(2 r ,£>) 



ynin^ sup(i r 
Suppress(i m , U[, V, s) 



return T> 



I having maximum support in D, since satisfying this con- 
straint incurs minimal distortion of T>. This is due to the 
fact that the minimum number of transactions in T> have to 
be distorted to augment the support of / to at least k. 

Next, while p remains unsatisfied in T> and p contains more 
than one generalized item (step 4), COAT performs steps 5 
to 10. In step 5, the item i m from p with the minimum 
support in V is selected to be generalized. Selecting i m 
this way attempts to minimize the number of generalizations 
required to satisfy p, as items with "low" support need to 
be generalized to meet the specified k. Subsequently, we 
identify the utility constraint ui from U for which i m 6 u; , in 
order to retrieve the items that are allowed to be generalized 
with i m (step 6). If at least one item apart from i m is 
contained in ui (step 7), item i m is generalized in a way 
that minimizes information loss (step 8) as illustrated in 
Algorithm [2] Otherwise we suppress i m through a function 
Suppress, given in Algorithm^ since applying generalization 
to increase the support of i m to k would result in violating 
ui (steps 9 — 10). 

Steps 11 to 14 aim to satisfy p by suppressing the min- 
imum number of items in T> required to satisfy this con- 
straint. When / consists of one (generalized) item only, and 
p is not satisfied (step 11), we iteratively suppress items in 
7, starting with the one having the minimum support, until 
p is met (steps 12 — 14). Last, T> is released (step 15). 

Algorithm 2 Generalize(i m ,ui,T) 

1. i s <— argmin UL( (i m ,i r ) ) 

VSrGlli\{*m} 

2. i (i m ,i s ) 

3. foreach p £ V : i m €pVt s £p 

4. p<-(pU{!})\{! m ,!„} 

5. ui <- («i U {i})\{i m ,i s }_ 

6. Update transactions of T> based on i 



Algorithm 3 Suppress (i m , ui, V, s) 



1. Ul 4- Ui\{i m } 

2. foreach p £ V : i m S P 

3. p<-p\{i m } 

4. Remove i m from all transactions of T> 

5. if more than s% of items are suppressed 

6. Error: U is violated 

Algorithms [2] and [3] indicate how COAT performs gener- 



alization and suppression respectively. Each of these oper- 
ations involves updating the privacy and utility constraint 
sets V and U, as well as selected transactions of T>. 

Specifically, Generalize (Algorithm [2]) operates as follows. 
In step 1, it identifies the item i s that can be generalized 
together with i m in a way that incurs the least possible in- 
formation loss according to the UL measure. Step 2 performs 
the mapping of the two items to a common generalized item 
i. Following that, steps 3 — 5 update the privacy and the 
utility constraints to reflect this generalization. Finally, the 
transactions of T> that supported any of i m ,i s are updated 
to support the generalized item i instead. 

Suppress (Algorithm[3} involves removing an item i m from 
the privacy and utility constraint sets (steps 1 — 3), and the 
transactions supporting it in T> (step 4). Finally, it checks 
whether the imposed suppression threshold s has been sur- 
passed (step 5). This happens when utility constraints are 
overly restrictive (e.g., they require all items to remain in- 
tact in the anonymized dataset) and a "low" suppression 
threshold is used. In this case, data owners are notified 
that the utility constraint set U has been violated and the 
anonymization process terminates (step 6). 

Example 5. We apply COAT on the dataset V of Fig. 
\l(a)\ using the constraints of Figs. \l(b)\ and \l(c)\ k = 5, and 
s = 15%. Since V contains pi,P2 whose itemsets are equally 
supported in T>, COAT arbitrarily considers p\ = {a,b,c}. 
Then, it selects b, which has the minimum support among 
a, b and c, and generalizes it together with a, as required by 
u\. This increases the support of (a,b)c to 7, satisfying p±. 
Subsequently, COAT considers p2 = {d, e, /, g, h}. Item d is 
minimally supported among the items of p2, thus it is con- 
sidered for generalization. However, d cannot be generalized 
due to «3, and it is suppressed, since its support is below k. 
After suppressing d, p2 is still not met. Since in p2, both 
g, h have minimum support, g is arbitrarily selected to be 
generalized. Item g can be generalized with any of e, f or h, 
but it is generalized with h, since (g, h) incurs the minimum 
information loss. This satisfies p2 and V is now satisfied. Li 
is also satisfied as shown in Example\^ 

5. SPECIFYING PRIVACY AND UTILITY 
CONSTRAINTS 

The notions of privacy and utility constraints, which re- 
flect itemsets deemed as potentially linkable and important 
for intended data analysis tasks respectively, are central to 
our anonymization approach. Our constraint specification 
framework allows data owners to formulate detailed con- 
straints based on their specific privacy and utility require- 
ments, which are given as input to COAT. However, ac- 
knowledging that constraint specification may be challeng- 
ing for data owners who lack domain knowledge, we present 
simple methods that aim to help such data owners formulate 
constraints. 

Section [5.11 discusses our Privacy constraint set generation 
(Pgen) algorithm that constructs a privacy constraint set 
automatically, assuming that attackers can use any part of 
any transaction to link published data to individuals. Pgen 
works by searching the original dataset for itemsets with 
"low" support, each of which is treated as potentially link- 
able and is modeled as a privacy constraint. Although the 
resultant privacy constraint set corresponds to a stringent 
privacy policy, we believe that adopting this policy is a safe 



choice when data owners are unable to specify which items 
are potentially linkable. Section 15.21 provides a recipe to 
reduce the effort of specifying utility constraints. 

5.1 Constructing a Privacy Constraint Set 

Before presenting Pgen, we capture the largest part of a 
transaction that can be used in linking attacks using Defi- 
nition 15.11 

Definition 5.1. (Maximal Infrequent Itemsets). 
Given a transactional dataset T>, and a parameter k, we de- 
fine the set of maximal infrequent itemsets in T> as those 
itemsets that have a support in the interval (0, k) in T>, and 
none of their proper supersets is supported in T>. 

Example [6] illustrates the above definition. 

Example 6. Consider a dataset comprised of the last 
three transactions of the dataset of Fig. \l(a)\ (associated 
with the itemsets {a,c,f}, {a,c} and {b, h\ respectively), 
and that k is set to 2. The lattice of itemsets in this dataset 
is illustrated in Fig. in which the support of each sup- 
ported itemset is shown next to it. As can be seen, the set of 
maximal infrequent itemsets in this dataset contains only 
acf and bh, as each of these itemsets is supported in the 
dataset and all of its proper supersets have a support of zero. 




Figure 6: An example of maximal infrequent item- 
sets 

Given a transactional dataset T>, and a parameter k, Pgen 
constructs a privacy constraint set V that contains all the 
maximal infrequent itemsets in T>. As mentioned above, the 
generated V can be given as input to COAT to ensure that 
anonymized data can prevent linking attacks based on any 
part of any transaction in T>. The pseudocode of Pgen is 
provided in Algorithm [4] 



Algorithm 4 Pgen(V, k) 



1. 


V <— sorted transactions of T> with respect to their size 




in decreasing order 


2. 


foreach T r 6 V,r = 1, N 


3. 


foreach T a £V,s = (r + 1), ...,N 


4. 


if T a C T r 




Remove T s from V 


5. 


Itemset I <— T r 


6. 


if sup(I,V) > k 


7. 


Remove T r from V 


8. 


return V 



Pgen starts by creating a privacy constraint set V , which 
is initialized by the set of transactions of the original dataset 
T>, each of which is treated as a privacy constraint. Clearly, 



this set may contain redundant itemsets, which would result 
in an unnecessary computational overhead if used as input 
to COAT. This is because COAT works by satisfying each 
privacy constraint iteratively. Therefore, Pgen implements 
a simple pruning strategy that removes redundant privacy 
constraints from V to reduce its size without affecting the 
privacy guarantees provided when V is satisfied. 

The first step of this strategy is to populate V with the 
set of transactions of T>, sorted in terms of decreasing size. 
Subsequently, in steps 2-4, transactions (T^) that are subsets 
of other transactions (T r ) are identified and removed from 
V . This is because these transactions cannot correspond 
to maximal infrequent itemsets, according to Definition [5~T] 
Next, steps 6 and 7 ensure that privacy constraints that do 
not require protection (i.e. itemsets induced by transactions 
having a support of at least k in T>) are not included in V. 
Finally, V, which contains the set of maximal infrequent 
itemsets in T>, is returned in step 8. This privacy constraint 
set can be given as input to COAT. Notice that Pgen has a 
quadratic run-time complexity, as it involves sorting, pair- 
wise comparison, and support computation for transactions. 
To illustrate how Pgen works, we provide Example [7] 

Example 7. Consider applying Pgen on the dataset of 
Example using k — 2. This results in initializing V 
with three privacy constraints pi = {a, c, f},p2 = {a, c} and 
ps = {6, h} (one for each transaction), which are sorted in 
terms of decreasing size. Subsequently, p2 is removed from 
V , because {a,c} is a subset of pi = {a, c, /}. Next, Pgen 
checks the support of pi, and so retains it in V as the sup- 
port of pi in this dataset is 1 £ (0,2). In the final iteration, 
Pgen examines ps, and retains it in V for the same reason. 
Thus, Pgen returns V = {pi,pa}- 

5.2 Formulating a Utility Constraint Set 

While privacy constraints can be extracted automatically 
as discussed above, this is difficult for utility constraints, be- 
cause they model application-specific data analysis require- 
ments. Thus, we assume that data owners are able to specify 
utility constraints to avoid distorting itemsets that need to 
be used in intended applications. 

When interested in generating anonymized data that al- 
lows the counts of aggregate concepts to be accurately deter- 
mined, for example, data users can formulate a utility con- 
straint for each of these concepts (itemsets), as explained in 
Section [3.31 These itemsets may be selected with the help 
of hierarchies or ontologies, which are specified by domain 
experts or constructed in an automated fashion P3] . A util- 
ity constraint containing the remaining items (i.e., those not 
contained in the selected itemsets) should also be specified 
to ensure that the utility constraint set is a partition of X 
(see Definition 13. 4|) . 

We emphasize that the way all items are generalized is 
governed by the utility loss function (see Definition 13. 6|) . 
which forces semantically related items to be generalized 
together. Example [8] illustrates how utility constraints may 
be specified. 



straint {a, b}, and include all the remaining diagnosis codes 
in a second constraint {c, d, e, /, g, h}. 



6. EXPERIMENTAL EVALUATION 

In this section, we compare COAT to Apriori [20] using 
four series of experiments. In the first series, we compare the 
amount of information loss the algorithms incur to achieve 
fc m -anonymity. The second and third series of experiments 
examine whether the algorithms can meet detailed privacy 
and utility requirements without harming data utility, and 
the last series evaluate their efficiency. 

6.1 Experimental setup and metrics 

We use two real-world transactional datasets, BMS- 
WebView-1 (BMS1) and BMS- Web View-2 (BMS2), which 
contain click-stream data from two e-commerce sites. The 
datasets have been used in evaluating prior work [5il20| and 
also as benchmarks in the 2000 KDD-Cup competition. Ta- 
ble [T] summarizes their characteristics. 



Dataset 


N 




Max. |T| 


Avg. |T| 


BMS- J 


59602 


497 


267 


2.5 


BMS-2 


77512 


3340 


161 


5.0 



Example 8. Consider that the dataset of Fig. 



7Ja~)\ha 



to be anonymized to support the study of Example[3\ in which 
the number of patients diagnosed with diabetes (i.e., trans- 
actions having a, b, or ah) needs to be accurately computed. 
To support this study, the hospital can specify a utility con- 



Table 1: Description of used datasets 

To ensure a fair comparison between COAT and Apriori, 
we configured the latter with the same hierarchies as in [20J 
and set the weights w(i m ) used in COAT based on a notion 
of semantic distance computed according to the aforemen- 
tioned hierarchies [24] . We did not compare our approach to 
the two other algorithms proposed by the authors of Apriori 
in [50]. This is because these algorithms have been shown 
to be comparable to Apriori in terms of effectiveness, while 
they are only applicable to datasets with a small domain of 
less than 50 items [20] (typically, transactional datasets have 
a domain size in the order of hundreds or thousands). We 
also did not compare our approach to those of [25] and [5J, 
since these approaches require a fixed categorization of items 
into potentially linkable and sensitive, a classification that 
is not applicable to the problem we tackle. 

Both COAT and Apriori were implemented in C++. All 
experiments were performed on an Intel 2.8GHz machine 
equipped with 4GB of RAM. 

To quantify information loss, we considered aggregate 
query answering as an indicative application, and mea- 
sured the accuracy of answering workloads of queries on 
anonymized data produced by the tested algorithms. This 
is a widely-used approach to characterize information loss 
[51151123] and is invariant of the way tested algorithms work. 
Consider the COUNT() query Q shown in Fig. [7] We obtain 
an accurate answer a(Q) for Q when this query is applied 
to original data T>, but not in the case of generalized data 
T>, as original items from I are mapped to generalized ones 
in X. Therefore, we can only estimate the answer for Q. 

Q: SELECT COUNT (T n (or f„ )) 
FROM V (or V) 

WHERE ii G T n A i 2 G T n A . . . A i, G T„ 
(or $(n) G T„ A ... A G T„) 

Figure 7: COUNT() query example 

This estimation can be performed by computing the prob- 
ability a transaction of T> satisfies Q, as n* =1 p(i r ), where 



p(i r ) is the probability of mapping an item i r , r — 1, q, 
in the query to a generalized item i m , assuming that i m can 
include any possible subset of the items mapped to it with 
equal probability, and that there are no correlations among 
generalized items [5][9][23] . An estimated answer e(Q) of Q 
is then derived by summing the corresponding probabilities 
across all transactions T n of T>. 

To measure the accuracy of estimating Q, we use the Rel- 
ative Error (RE) measure computed as RE(Q) — \a(Q) — 
e(Q)\/a(Q). Given a workload of queries, the Average 
Relative Error (AvgRE) for all queries, reflects how well 
anonymized data supports query answering [9j[23J. To mea- 
sure AvgRE, we constructed workloads comprised of 1000 
COUNT() queries similar to Q. The items participating in 
these queries were selected randomly from the generalized 
items. 

6.2 Achieving k m -anonymity 

In this section, we empirically confirm that COAT not 
only satisfies fc m -anonymity, but does so with up to 9 times 
less information loss than Apriori. Specifically, we ran 
COAT by including all m-itemsets in the privacy constraint 
set V, considering a utility constraint set U that contains all 
items (effectively allowing all possible generalizations), and 
setting s = 0.5%. Both algorithms used the same k and m 
values. The results with respect to AvgRE and UL measures 
are summarized in Sections 16.2.11 and 16.2.21 respectively. In 
these experiments COAT did not suppress any items. 

6.2.1 Capturing data utility using AvgRE 

Figs. |8(a)| and [8(b)1 report AvgRE scores for BMS1, where 
the number of items q included in a query was 1 and 3 re- 
spectively, m was set to 2, and k was selected over the range 
[2, 50]. As expected, increasing k induced more information 
loss due to the utility /privacy trade-off. Increasing q had a 
similar effect because accurately answering queries involving 
many items is more difficult. COAT outperformed Apriori 
in both cases, achieving up to 9 times better AvgRE scores. 
This is because, as k increases, the recoding model of Apriori 
forces an increasingly large number of items to be general- 
ized together, while the model in COAT generalizes no more 
items than required to protect an itemset. Similar results 
were achieved for BMS2 (omitted for brevity). 



strategy on data utility was even more evident in the case 



of BMS2, as shown in Fig. 9(b) 





Figure 8: AvgRE vs. 
BMS1 



k for (a) q — 1 and (b) q = 3 in 



We also executed COAT and Apriori using k = 5, and 
varied m between 1 and 3. The AvgRE scores for BMS1 
are shown in Fig. |9(a)| Apriori incurred 7 times more infor- 
mation loss than COAT to anonymize BMS1 when m = 3. 
This is because the number of items that Apriori forces to 
be generalized together to protect m-itemsets increases sub- 
stantially as m grows. The impact of this generalization 





Figure 9: AvgRE vs. m for (a) BMS1 and (b) BMS2 



6.2.2 Capturing data utility using UL 

We compared the two algorithms with respect to the UL 
measure. Fig. 10(a)| shows the result of running these al- 
gorithms on BMS2 using m = 2 and k values between 2 
and 50. Observe that Apriori was fairly insensitive to k up 
to 25. In fact, Apriori over- generalized itemsets by increas- 
ing their support to much larger values than k due to its 
recoding strategy. On the other hand, COAT achieved a 
much better result for all tested k values, due to the fine- 
grained generalization model it employs. We also exam- 
ined how the algorithms fared with respect to UL when m 
varies between 1 and 3, and k = 5. Observe that Apri- 
ori incurred substantially more information loss than COAT 
for all tested m values. This again suggests that the gen- 
eralization scheme of Apriori distorts data much more than 
our set-based anonymization strategy. Similar results were 
obtained for BMS1 (omitted for brevity). 

We do not report additional results with respect to UL 
because COAT is designed to optimize this measure, and 
thus outperformed Apriori in all tested cases. 





Figure 10: (a) UL vs. k and (b) UL vs. m for BMS2 

6.3 Privacy constraints vs. data utility 

In this section, we experimentally confirm that our ap- 
proach can generate anonymizations with a high level of 
data utility through the specification of detailed privacy con- 
straints by data owners. The impact of constraints gener- 
ated by Pgen on data utility will be examined in Section 
[Jj We constructed two types of privacy policies to simulate 
different privacy requirements, one in which itemsets that 
require protection are all of the same size and comprised of 
certain items from I, and another in which such itemsets 
differ in size. The utility constraint set U used in COAT 
was set as in Section T6. 2 1 



6.3.1 Protecting itemsets comprised of certain items 



We considered 5 privacy policies of the first type, PP1,. . . , 
PP5, each of which assumes that all 2-itemsets containing a 
certain percent of randomly selected items require protection 
with k = 5. The mappings between privacy policies and the 
percent of such items are as follows: PP1 —¥ 2%, PP2^r 5%, 
PP3-> 10%, PP4-> 25%, PP5-> 50%. These policies are 
taken into account by COAT, but not by Apriori, which 
needs to protect all 2-itemsets to satisfy them. 

We first studied how privacy policies affect data utility, 
as captured by AvgRE. Figs. 11(a) and 11(b) illustrate the 
results for q = 1 and q = 3 respectively. As expected, be- 
cause it avoids unnecessarily protecting itemsets that are 
not specified by these policies, COAT distorted data signif- 
icantly less than Apriori. This is supported by the AvgRE 
scores for COAT which were significantly better than Apri- 
ori. Furthermore, as policies become more strict (i.e., require 
protecting itemsets induced by a larger percent of items from 
T), the AvgRE scores for COAT became slightly worse due 
to the utility/privacy traded-off. Nevertheless, these scores 
remain substantially better than that of Apriori in all cases. 
We repeated the same experiments for BMS2, and obtained 



similar results shown in Figs. 11(c) and 11(d) respectively 
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Privacy Policy 



PP1 PP2 PP3 PP4 PP5 
Privacy Policy 
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PP1 PP2 PP3 PP4 
Privacy Policy 
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PP2 PP3 PP4 
Privacy Policy 

(d) 



Figure 11: AvgRE vs. Privacy Policy (a) for q — 1 
and (b) for q = 3 (in BMS1), and (c) for q = 1 and 
(d) for q = 3 (in BMS2) 

6.3.2 Protecting itemsets of varying size 

We simulated 4 privacy policies of the second type: PP6, 
. . . , PP9. In each of these policies, V consisted of itemsets 
with size 1 to 4, as shown in Table [2] and k — 5. To ac- 
count for these policies, Apriori had to protect all possible 
4-itemsets, and thus it was configured with m — 4. 



Privacy 


% of 


% of 


% of 


% of 


Policy 


items 


2-itcmscts 


3-itcmscts 


4-itemsets 


PP6 


33% 


33% 


33% 


i% 


PP7 


30% 


30% 


30% 


10% 


PP8 


25% 


25% 


25% 


25% 


PP9 


16.7% 


16.7% 


16.7% 


50% 



Table 2: Summary of privacy policies PP6, 



PP9 



The AvgRE scores for BMS1 and BMS2, and a wo rkload 



comprised of queries with q = 2 are depicted in Figs. 



and 12(b) respectively. Notice that COAT achieved bet- 
ter AvgRE scores in both datasets, permitting answers to 



queries up to 40 times more accurately than Apriori. This 
is because COAT applies generalization to each privacy con- 
straint separately, thereby applying the minimum level of 
generalization required to satisfy the specified constraint. 




PP6 PP7 PP8 

Privacy Policy 

(a) 



Figure 12: AvgRE vs. Privacy Policy for q — 2 (a) 
in BMS1 and (b) in BMS2 



6.4 Utility constraints vs. data utility 

The experiments reported in this section examine the ef- 
fect of utility constraints on data utility. We assumed 4 
utility policies: UP1, . . . , UP4- Each policy contains groups 
of a certain number of semantically close items (i.e., sibling 
items in the hierarchy). The mappings between utility poli- 
cies and the size of these groups are as follows: UPl^r 25, 
UP2^> 50, UP3-* 250, and t/P^-s- 500. Items in each group 
are allowed to be generalized together. Note that UP1 and 
UP2, which have smaller group sizes, are very stringent and 
may require suppression to be satisfied. For this reason, we 
configured COAT with a small suppression threshold s of 
0.5%. Apriori does not address these policies because item 
generalization is not guided by utility constraints. Also, the 
privacy constraint set V included all 2-itemsets and Apriori 
was run with m = 2. 

AvgRE scores for a workload of queries with q — 1 and 
g = 3, are shown in Figs. 13(a) and 13(b) respectively, 
for BMS1. Observe that COAT significantly outperformed 
Apriori for all utility policies. Furthermore, the number 
of suppressed items was very small (0.01%) and occurred 
only in the case of UP1. This illustrates the effectiveness 
of COAT, which suppresses the minimum number of items 
required, and only when utility constraints cannot be other- 
wise met. We also note that COAT was able to satisfy the 
imposed utility policies in all cases, unlike Apriori which was 
unable to meet any of them. Interestingly, the AvgRE scores 
for COAT were not substantially affected by utility policies. 
This is because COAT applied a much lower level of general- 
ization than that specified by the utility constraints. Similar 



trends were observed for BMS2 (omitted for brevity). 
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Figure 13: AvgRE vs. Utility Policy (a) for 
and (b) for q = 3 (in BMS1 ) 



6.5 Efficiency of Computation 

We compared COAT and Apriori in terms of efficiency. 
We first examined the scalability of these algorithms with re- 
spect to dataset cardinality, by applying them on a dataset 
constructed by randomly selecting transactions of BMS1. 
COAT was configured by setting V and U as in Section 
16.21 m = 2 and k = 5. Apriori was run with the same k 
and m values. Fig. 14(a) reports run-time as cardinality 
varies from IK to 50K transactions. COAT scales better 
than Apriori with the size of the dataset; up to 2.5 times 
faster. This is because COAT prunes the space by discarding 
protected itemsets as cardinality increases, whereas Apriori 
considers all m-itemsets as well their possible generaliza- 
tions. 





Figure 14: Efficiency vs. (a) dataset size \D\ and (b) 

k 

Last, we evaluated the impact of k on the run-time of 
COAT and Apriori on BMS1. We used k values between 2 
and 50, and set up all other parameters as in the previous 
experiment. As can be seen in Fig. 14(b) COAT is slightly 



less efficient than Apriori. This is due to the fact that COAT 
generalizes one item at a time, exploiting the flexibility of 
the set-based anonymization model. By comparison, Apri- 
ori generalizes entire subtrees of items and thus reaches the 
specified k faster. Nevertheless, the computation cost of 
COAT was less than half a minute, remaining sub-linear for 
all testes values of k. 

7. CASE STUDY: DIAGNOSIS CODES 

In this section, we examine whether COAT can produce 
anonymized data that permits accurate analysis in a real- 
world scenario involving detailed, application-specific util- 
ity requirements. In this context, a transactional dataset 
(referred to as EMR) derived from the Electronic Medical 
Record system of the Vanderbilt University Medical Cen- 
ter [18] needs to be published to enable certain biomedical 
studies. Each transaction of EMR corresponds to a distinct 
patient, and contains his/her diagnosis codes in the form of 
ICD-9 codes Q. Table [3] summarizes the characteristics of 
EMR. 



Dataset 


N 


M 


Max. \T\ 


Avg. T 


EMR 


1336 


5830 


25 


3.1 



Table 3: Description of the EMR dataset. 

The studies that anonymized data needs to support focus 
on 20 different disorders, each of which is modeled as a set of 
ICD-9 codes. For instance, pancreatic cancer is represented 
as a set of 7 ICD-9 codes, which correspond to different 



forms of pancreatic cancer and indicate that a patient suffers 
from this disorder. To support these studies, the number 
of patients suffering from each of these disorders needs to 
be accurately computed. At the same time, the linkage of 
transactions to patients' identities based on any combination 
of ICD-9 codes must be prevented, because the vast majority 
of ICD-9 codes contained in EMR can be found in other 
sources, as verified in our previous study |10j . 

To achieve both privacy and utility, we used our Pgen 
algorithm to construct a privacy constraint set, and formu- 
lated a utility constraint set comprised of 20 utility con- 
straints, each for a different disorder (e.g., we specified a 
utility constraint that contains the 7 ICD-9 codes corre- 
sponding to pancreatic cancer). Furthermore, we configured 
COAT by setting the weights w(i m ) used in it based on a 
notion of semantic similarity [24] computed according to the 
hierarchy for ICD-9 codes Q and limited the maximum al- 
lowable fraction of suppressed items by setting s to 0.5%. 
Apriori was also applied to anonymize EMR, although it 
provides no guarantees that utility constraints are satisfied. 

We evaluated the utility of anonymizations produced by 
both COAT and Apriori in two ways. First, we examined 
whether the produced anonymizations satisfied the specified 
utility constraint set. In fact, anonymizations constructed 
by COAT satisfy the latter set for all tested k values (namely 
2, 5, 10, 25 and 50). Thus, COAT managed to generate prac- 
tically useful anonymizations that allow the number of pa- 
tients having any of the 20 disorders used in the intended 
studies to be accurately computed (see Corollary I3.3[) . On 
the other hand, the anonymizations constructed by the Apri- 
ori algorithm did not satisfy the specified utility constraint 
set for any of the tested k values. Therefore, we did not eval- 
uate the data utility of anonymizations produced by Apriori 
using other criteria. 

In addition to satisfying the specified utility constraints, it 
is also important to generate anonymized data with "low" in- 
formation loss that can support general data analysis tasks. 
Therefore, we investigated whether our method can generate 
anonymizations that are useful in aggregate query answer- 
ing. To capture the amount of information loss, we used 
the AvgRE measure, discussed in Section [6.11 AvgRE was 
computed using two different workloads referred to as Wl 
and W2 respectively. Wl is comprised of COUNT0 queries 
that retrieve combinations of ICD-9 codes supported by at 
least 10% of the transactions of EMR. These combinations 
correspond to frequently co-occurring disorders (e.g., dia- 
betes and hypertension) that are important in the context 
of biomedical data analysis, and are different from the 20 dis- 
orders contained in the utility constraint set. W2 is similar 
to the workload considered in Section [6TT] It is comprised of 
1000 COUNT() queries similar to the query shown in Fig. 
[7] each of which is comprised of 2 ICD-9 codes randomly 
selected among generalized items. This workload models a 
scenario involving anonymized data queried by users with 
various data analysis requirements. 

Fig. 15(a) reports the AvgRE scores for EMR, where k 
was selected over the range [2,50], and Wl was used. As 
can be seen, the AvgRE scores indicate that anonymized 
data permits queries that are common in biomedical data 
analysis tasks to be answered fairly accurately, even when 
a strict privacy policy is adopted. The corresponding result 



1 ICD-9 is the official system of assigning codes to diagnoses 
in the U.S. 



2 http://www. cdc.gov/nchs/icd/icd9cm. htm 




Figure 15: AvgRE vs. k for EMR-D computed using 
(a) Wl and (b) W2 



for W2 is reported in Fig. 15(b) Again, the AvgRE scores 



confirm that a low level of information loss was incurred to 
anonymize EMR, particularly when k is 5 or lower as it is 
commonly the case when publishing biomedical data [3]. 

In summary, our case study confirms the effectiveness of 
our anonymization framework when there are specific utility 
requirements. This is because it allows the EMR dataset to 
be published in a way that prevents linking attacks with 
respect to any portion of any transaction of this dataset, 
helps biomedical studies focusing on specific disorders, and 
allows accurate data analysis. 

8. CONCLUSIONS AND FUTURE WORK 

Existing approaches for anonymizing transactional data 
often produce excessively distorted data that is of limited 
utility, due to the fact that they incorporate coarse pri- 
vacy requirements, are agnostic with respect to data utility 
requirements, and search a fraction of the solution space. 
In response, we developed a novel approach that overcomes 
these limitations by allowing fine-grained privacy and util- 
ity requirements to be specified as constraints, and COAT 
(COnstraint-based Anonymization of Transactions), an al- 
gorithm that transforms data using item generalization and 
suppression to satisfy the specified constraints, while mini- 
mally distorting data. We also demonstrated the effective- 
ness of our approach using extensive experiments on bench- 
mark datasets and a case study on patient-specific data 
containing diagnosis codes. Our results demonstrate that 
COAT is able to satisfy a wide range of privacy and utility 
requirements with less information loss than the state-of-the 
art method, and to anonymize data in a way that prevents 
identity disclosure and retains data utility for intended ap- 
plications. 

This work also opens up several directions for future inves- 
tigation. First, although experimentally shown to be both 
effective and efficient in practice, the COAT algorithm is 
heuristic in nature, and, as such, it does not guarantee gen- 
erating optimal anonymizations in terms of minimum infor- 
mation loss. To address the growing size of datasets and do- 
mains, we intend to develop approximation algorithms that 
can offer such guarantees. Second, we aim at extending our 
framework to deal with the problem of attribute disclosure 
based on the Z-diversity privacy principle [11] , 
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